The dataset selected contains data relating to salaries for jobs in Artificial Intelligence. Much consideration was given to dataset selection; however, this dataset is personally interesting and far less morbid than others considered such as the Scottish Government's datasets on road casualties, healthcare, and poverty.
The dataset is a suitable size for this project. I experimented with some larger datasets such as that on road casualties in the UK (found here: https://www.data.gov.uk/dataset/cb7ae6f0-4be6-4935-9277-47e5ce24a11f/road-safety-data) where there are several files that could make for some really interesting analysis and visualisation. However, this resulted in Jupyter Notebooks slowing down significantly. I have therefore opted for a smaller dataset that is still interesting and also more relevant to my degree. It will allow for fruitful analysis as it contains several columns that take a mixture of qualitative and quantitative data. This will allow me to ask various questions of the dataset, such as:
These questions will provide a deeper insight into the dataset. They may reveal a trend over time, or they may reveal surpising relationships. I would expect, for example, the salary to have been on a steady incline over recent years given the rapid advancement of the field. It is more difficult to predict, without exploratory analysis, where the majority of AI jobs are located.
The link to the selected dataset is: https://github.com/plotly/datasets/blob/master/salaries-ai-jobs-net.csv
The dataset is accessed through the url provided above. There is only one url, which links to one table with several rows and columns. The dataset is loaded into a Pandas DataFrame below. Then, the number of rows and columns the dataset contains is displayed, along with the first 5 rows of the dataset.
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/plotly/datasets/master/salaries-ai-jobs-net.csv')
print("Salaries dataset includes", df.shape[0], "rows and", df.shape[1], "columns")
Salaries dataset includes 637 rows and 11 columns
df.head()
| work_year | experience_level | employment_type | job_title | salary | salary_currency | salary_in_usd | employee_residence | remote_ratio | company_location | company_size | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2022 | MI | FT | Data Analyst | 90000 | SGD | 65950 | SG | 50 | SG | M |
| 1 | 2022 | MI | FT | AI Scientist | 200000 | USD | 200000 | US | 100 | US | M |
| 2 | 2022 | EN | FT | Machine Learning Developer | 180000 | USD | 180000 | US | 100 | US | L |
| 3 | 2022 | MI | FT | Data Scientist | 153000 | USD | 153000 | US | 100 | US | L |
| 4 | 2022 | SE | FT | Data Engineer | 210000 | USD | 210000 | US | 100 | US | M |
From the above we can see that each row represents a single employee. There are 11 potential variables to be used in analysis.
We can inspect the data types of each feature through the info() function which displays each column along with the number of non-null values and its corresponding type. We can also make use of the isnull() function to check for missing data.
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 637 entries, 0 to 636 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 work_year 637 non-null int64 1 experience_level 637 non-null object 2 employment_type 637 non-null object 3 job_title 637 non-null object 4 salary 637 non-null int64 5 salary_currency 637 non-null object 6 salary_in_usd 637 non-null int64 7 employee_residence 637 non-null object 8 remote_ratio 637 non-null int64 9 company_location 637 non-null object 10 company_size 637 non-null object dtypes: int64(4), object(7) memory usage: 54.9+ KB
df.isnull().sum()
work_year 0 experience_level 0 employment_type 0 job_title 0 salary 0 salary_currency 0 salary_in_usd 0 employee_residence 0 remote_ratio 0 company_location 0 company_size 0 dtype: int64
Evidently, the dataset does not contain any missing values, so no action should be taken in this regard. If the dataset included missing values, a decision would be made as to whether the respective rows should be removed or whether a value should be imputed. This decision depends on the context and there is no hard and fast rule. It could be that a standard value (such as -1) is inserted into fields with missing values in place of the null value.
From the information above, I will assume the 'salary' column relates to an employee's annual salary. It would be helpful to be able to see at a glance an employee's monthly salary too. To insert this into the dataset we can extract the employee's salary to calculate the employee's monthly salary, then create a new feature to store the result.
df['monthly_salary'] = df['salary']/12
df['monthly_salary'] = df['monthly_salary'].astype('int64') #store as integers
Another part of data cleaning and transformation is standardising the data to ensure the data is prepared for analysis. We must first check what values a column contains. This is achieved through the unique() function.
def displayUniqueValues():
for col in df.columns:
print(col, df[col].unique(), "\n")
displayUniqueValues()
work_year [2022 2020 2021]
experience_level ['MI' 'EN' 'SE' 'EX']
employment_type ['FT' 'PT' 'CT' 'FL']
job_title ['Data Analyst' 'AI Scientist' 'Machine Learning Developer'
'Data Scientist' 'Data Engineer' 'Machine Learning Scientist'
'Machine Learning Engineer' 'Data Science Manager' 'ML Engineer'
'Data Analytics Manager' 'ETL Developer' 'Lead Data Engineer'
'Data Architect' 'Head of Machine Learning' 'Data Science Engineer'
'Head of Data Science' 'Analytics Engineer' 'Machine Learning Manager'
'Director of Data Science' 'NLP Engineer' 'Business Data Analyst'
'Machine Learning Infrastructure Engineer'
'Applied Machine Learning Scientist' 'Applied Data Scientist'
'Computer Vision Engineer' 'Head of Data' 'Research Scientist'
'Principal Data Analyst' 'Product Data Analyst' 'Data Analytics Lead'
'Data Engineering Manager' 'Data Analytics Engineer'
'Principal Data Scientist' 'Computer Vision Software Engineer'
'Financial Data Analyst' 'Lead Data Scientist'
'Lead Machine Learning Engineer' 'Principal Data Engineer'
'Big Data Engineer' 'BI Data Analyst' 'Data Science Consultant'
'3D Computer Vision Researcher' 'Lead Data Analyst'
'Marketing Data Analyst' 'Director of Data Engineering'
'Cloud Data Engineer' 'Big Data Architect' 'Staff Data Scientist'
'Finance Data Analyst' 'Data Specialist']
salary [ 90000 200000 180000 153000 210000 100000 150075 110925
22800 160000 92000 202900 131300 20000 15000 175000
135000 193000 83000 75000 55000 186000 148800 112900
90320 240000 300000 62500 95000 120000 145000 105400
43200 215300 158200 209100 154600 115934 81666 155000
80000 164000 132000 170000 123000 189650 164996 50000
150000 165400 132320 208775 147800 136994 101570 128875
93700 6000000 28500 183600 100800 40000 30000 70000
60000 140400 45000 260000 35000 82900 63900 112300
241000 159000 58000 136000 108800 242000 165220 120160
124190 181940 220110 160080 126500 106260 116000 99000
120600 130000 102100 84900 136620 99360 146000 110000
161342 137141 167000 211500 138600 192400 90700 61300
113000 95550 115500 243900 156600 136600 109280 224000
167875 205300 176000 144000 200100 70500 54000 184700
175100 140250 116150 99050 85000 214000 192600 266400
213120 115000 141300 206699 99100 110500 192564 144854
230000 150260 67000 52000 154000 126000 129000 140000
69000 25000 105000 220000 65000 324000 216000 185100
104890 157000 250000 1400000 2400000 53000 109000 88000
10000 87000 66500 78000 121000 57000 48000 165000
29000 69999 52800 59000 152500 405000 380000 177000
62000 8500000 7000000 148000 24000 38400 82500 42000
3000000 125000 700000 8760 51999 450000 41000 159500
13400 103000 12000 400000 270000 68000 138000 45760
235000 225000 44000 2250000 37456 106000 11000000 14000
81000 2200000 276000 188000 174000 93000 2100000 51400
61500 720000 108000 31000 52500 91000 1600000 256000
72500 185000 112000 65720 72000 111775 93150 21600
4900000 1250000 190000 1200000 21000 4000000 1799997 9272
147000 120500 21844 4000 22000 76760 1672000 420000
30400000 195000 32000 416000 40900 2500000 8000 4450000
423000 56000 299000 98000 325000 34000 600000 69600
435000 37000 19000 74000 152000 18000 102000 39600
1335000 1450000 73000 190200 118000 138350 130800 168000
412000 151000]
salary_currency ['SGD' 'USD' 'EUR' 'AUD' 'GBP' 'CAD' 'INR' 'CNY' 'PLN' 'CHF' 'JPY' 'HUF'
'MXN' 'TRY' 'CLP' 'DKK' 'BRL']
salary_in_usd [ 65950 200000 180000 153000 210000 100000 150075 110925 22800 160000
92000 202900 131300 22809 15000 175000 135000 138851 83000 97341
71383 186000 148800 112900 90320 240000 300000 68324 123298 120000
145000 105400 49268 116809 215300 158200 209100 154600 115934 81666
155000 87454 164000 132000 170000 123000 189650 164996 54659 117990
165400 132320 208775 147800 136994 101570 128875 93700 78747 36989
183600 100800 51915 38936 43727 32795 76522 103830 90851 77873
140400 65591 49193 260000 45425 58404 51064 60000 82900 63900
112300 241000 159000 80000 58000 136000 108800 242000 64894 165220
120160 124190 181940 220110 160080 126500 106260 116000 99000 120600
130000 90000 150000 102100 84900 136620 99360 146000 110000 161342
137141 167000 211500 138600 192400 90700 61300 113000 95550 115500
243900 156600 136600 109280 224000 167875 205300 176000 144000 200100
70500 54000 184700 175100 140250 116150 99050 85000 75000 214000
192600 266400 213120 115000 141300 206699 99100 110500 192564 144854
230000 150260 67000 52000 154000 126000 129000 140000 69000 25000
105000 50000 220000 181703 65000 324000 216000 185100 104890 78660
117104 196650 37047 18374 31498 57938 157000 70794 71056 109000
69221 10000 20000 102839 52309 78000 87052 40000 62311 54742
92920 86332 165000 35372 31702 69999 57720 64497 152500 405000
380000 121787 177000 67777 48000 77364 63711 161791 24000 38400
82500 49646 40570 125000 9466 10354 110037 21863 82744 59303
62649 82528 55000 250000 70000 130026 63831 68428 450000 46759
74130 127221 13400 75774 103000 12000 5409 270000 54238 47282
153667 28476 59102 138000 79197 45760 53192 235000 79833 225000
76833 50180 88654 103160 113476 94564 30428 187442 51519 106000
112872 36259 15966 95746 76958 89294 29751 276000 188000 174000
93000 28399 60757 70139 6072 33511 96282 12103 36643 72212
91000 99703 103691 21637 42000 63810 109024 256000 72500 185000
69741 112000 20171 77684 72000 65013 28016 111775 93150 25532
66265 16904 190000 141846 16228 71786 35735 24823 54094 24342
9272 147000 96113 21844 51321 40481 4000 39916 87000 26005
90734 22611 5679 81000 40038 2859 61467 195000 37825 416000
56256 33808 116914 46597 8000 41689 114047 5707 56000 28609
43331 47899 98000 66022 56738 325000 45896 40189 600000 12901
5882 42197 62726 21669 87738 61896 74000 152000 18000 18907
173762 148261 38776 46809 18053 91237 19609 62000 73000 45391
190200 118000 138350 130800 45618 168000 119059 423000 28369 412000
151000 94665]
employee_residence ['SG' 'US' 'EG' 'PT' 'ID' 'AU' 'GB' 'DE' 'IN' 'FR' 'GR' 'CA' 'ES' 'IT'
'AR' 'AE' 'BO' 'IE' 'SI' 'MY' 'JP' 'EE' 'NL' 'PK' 'BR' 'PL' 'HN' 'TN'
'CZ' 'AT' 'CH' 'RU' 'DZ' 'VN' 'IQ' 'BE' 'UA' 'NG' 'BG' 'PH' 'HU' 'MX'
'TR' 'JE' 'PR' 'RS' 'KE' 'CO' 'NZ' 'IR' 'RO' 'CL' 'DK' 'CN' 'HK' 'MD'
'LU' 'HR' 'MT']
remote_ratio [ 50 100 0]
company_location ['SG' 'US' 'EG' 'PT' 'ID' 'AU' 'GB' 'DE' 'GR' 'CA' 'IN' 'ES' 'IT' 'MX'
'AE' 'IE' 'LU' 'SI' 'MY' 'EE' 'NL' 'FR' 'PL' 'HN' 'CZ' 'AT' 'CH' 'PK'
'JP' 'DZ' 'BR' 'RO' 'IQ' 'BE' 'RU' 'UA' 'NG' 'DK' 'TR' 'CN' 'HU' 'KE'
'CO' 'NZ' 'IR' 'CL' 'MD' 'VN' 'AS' 'HR' 'IL' 'MT']
company_size ['M' 'L' 'S']
monthly_salary [ 7500 16666 15000 12750 17500 8333 12506 9243 1900
13333 7666 16908 10941 1666 1250 14583 11250 16083
6916 6250 4583 15500 12400 9408 7526 20000 25000
5208 7916 10000 12083 8783 3600 17941 13183 17425
12883 9661 6805 12916 6666 13666 11000 14166 10250
15804 13749 4166 12500 13783 11026 17397 12316 11416
8464 10739 7808 500000 2375 15300 8400 3333 2500
5833 5000 11700 3750 21666 2916 6908 5325 9358
20083 13250 4833 11333 9066 20166 13768 10013 10349
15161 18342 13340 10541 8855 9666 8250 10050 10833
8508 7075 11385 8280 12166 9166 13445 11428 13916
17625 11550 16033 7558 5108 9416 7962 9625 20325
13050 11383 9106 18666 13989 17108 14666 12000 16675
5875 4500 15391 14591 11687 9679 8254 7083 17833
16050 22200 17760 9583 11775 17224 8258 9208 16047
12071 19166 12521 5583 4333 12833 10500 10750 11666
5750 2083 8750 18333 5416 27000 18000 15425 8740
13083 20833 116666 200000 4416 9083 7333 833 7250
5541 6500 10083 4750 4000 13750 2416 4400 4916
12708 33750 31666 14750 5166 708333 583333 12333 2000
3200 6875 3500 250000 10416 58333 730 37500 3416
13291 1116 8583 1000 33333 22500 5666 11500 3813
19583 18750 3666 187500 3121 8833 916666 1166 6750
183333 23000 15666 14500 7750 175000 4283 5125 60000
9000 2583 4375 7583 133333 21333 6041 15416 9333
5476 6000 9314 7762 1800 408333 104166 15833 100000
1750 333333 149999 772 12250 10041 1820 333 1833
6396 139333 35000 2533333 16250 2666 34666 3408 208333
666 370833 35250 4666 24916 8166 27083 2833 50000
5800 36250 3083 1583 6166 12666 1500 8500 3300
111250 120833 6083 15850 9833 11529 10900 14000 34333
12583]
From the information above, there does not appear to be any bad data in the dataset. The values in the dataset for each column appear to conform to a common standard. If we consider 'company_location', for example, we can see the standard is the abbreviated location. If a value in this column appeared as "Great Britain", this is where the value would be replaced with "GB" in order to standardise the data. The only point to note is that there are two very similar job titles that likely relate to the same thing. The title 'ML Engineer' is probably the same job as 'Machine Learning Engineer'.
What is notable from the above information, though, is that there is no obvious index column. A unique identifier might be a single attribute, or it might be a combination of attributes and relationships. It is important to be able to identify a single entity like an employee for accuracy. This is crucial in databases because it enables users to locate one unique record among many records. If an employee leaves the organisation, for example, we want to maintain the integrity of the data by removing, or hiding, the correct employee from the user so that ex-employee is not confused for someone else.
For this dataset, I will create an artificial unique identifier for the sole purpose of being able to identify a single employee. I will also replace 'ML Engineer' with 'Machine Learning Engineer'.
df.insert(0, 'employee_id', range(1, 1+len(df)))
df['job_title'] = df['job_title'].replace({'ML Engineer':'Machine Learning Engineer'})
To summarise this section, we have:
With the dataset in good condition to move forward, the next task is to conduct Exploratory Data Analysis (EDA).
In performing EDA, we can gain a better understanding of the dataset. One important aspect of EDA is understanding the distribution of variables; that is, the probability of a variable taking on a range of values. This can prevent problems arising because it establishes if the dataset contains outliers, large majority values, and flat and wide values.
A dataset that contains such occurrences can result in misleading information. For example, an outlier is a value which is very rare. This could indicate a data entry error. The rare value might be true - it simply depends on the context; however, care and consideration should be given to such values.
Furthermore, a large majority value could falsely suggest there is a lot of data. In fact, large majority values all relate to the same thing, hence falsely suggesting a large volume of data.
In addition, care should be taken where variables are flat and wide. These are minorities in the dataset where there is very few occurrences of these values. An example is an identifier variable. It is certain these variables will occur only one or two times in the dataset; they will not be of much use when trying to detect relationships and trends, for example. In this project, an example is the newly created 'employee_id' field because it is a variable which is used to identify each single employee; each employee will have a unique index number that will obviously only occur once.
This EDA will explore the dataset further that will answer the questions outlined at the start of this notebook either in part or in full. The EDA may lead to more questions that will add more depth to the basic analysis. Prior to visualisation of distributions, some summary statistics are described below.
The function describe() is very helpful in calculating some descriptive statistics of the dataset. Below we can see various statistics such as the mean of each numerical column, along with the minimum and maximum value. The minimum (earliest) year value is 2020 and the maximum is 2022, suggesting the dataset covers 2020 to 2022. We can also see that the lowest salary is 2,859 USD; the highest being 600,000 USD; and the average is 113,275 USD.
df.describe()
| employee_id | work_year | salary | salary_in_usd | remote_ratio | monthly_salary | |
|---|---|---|---|---|---|---|
| count | 637.000000 | 637.000000 | 6.370000e+02 | 637.000000 | 637.000000 | 6.370000e+02 |
| mean | 319.000000 | 2021.430141 | 3.151061e+05 | 113275.439560 | 70.879121 | 2.625854e+04 |
| std | 184.030342 | 0.689250 | 1.508096e+06 | 70874.620746 | 40.869244 | 1.256747e+05 |
| min | 1.000000 | 2020.000000 | 4.000000e+03 | 2859.000000 | 0.000000 | 3.330000e+02 |
| 25% | 160.000000 | 2021.000000 | 7.000000e+04 | 63831.000000 | 50.000000 | 5.833000e+03 |
| 50% | 319.000000 | 2022.000000 | 1.150000e+05 | 103000.000000 | 100.000000 | 9.583000e+03 |
| 75% | 478.000000 | 2022.000000 | 1.650000e+05 | 150075.000000 | 100.000000 | 1.375000e+04 |
| max | 637.000000 | 2022.000000 | 3.040000e+07 | 600000.000000 | 100.000000 | 2.533333e+06 |
To visualise categorical variables, we can display the number of occurrences of each categorical value using seaborn's countplot(). Simply put, this visualisation "can be thought of as a histogram across a categorical, instead of quantitative, variable" [2]. I will utilise this to visualise the distribution of categorical variables.
To visualise numerical variables, it is appropriate to visualise distribution through a histogram, similar to a bar plot, where the axis that represents the variable in question is divided into a number of bins. Within each bin is the number of occurrences of each value the variable takes - the more frequent the occurrence, the higher the bin. Using seaborn, a histogram can be generated through displot() or histplot(). [3]
It should be noted that there is no data guide or data dictionary. Some assumptions are made with regard to what some values represent and these are made clear where applicable.
#required imports
import seaborn as sns
import matplotlib.pyplot as plt
df['work_year'].value_counts()
2022 347 2021 217 2020 73 Name: work_year, dtype: int64
sns.countplot(df, x='work_year', palette='RdPu')
<Axes: xlabel='work_year', ylabel='count'>
We can see from the above that the distribution of 'work_year' is not entirely imbalanced. The dataset contains mostly employees' work records from 2022. There are slightly less occurrences of salaries in 2021, and significantly less in 2020.
df['experience_level'].value_counts()
SE 299 MI 220 EN 92 EX 26 Name: experience_level, dtype: int64
sns.countplot(df, x='experience_level', palette='RdPu')
<Axes: xlabel='experience_level', ylabel='count'>
Here I will assume:
We can see from the above chart that most employees in the dataset are Senior, with the next most common experience_level being "Middle". We then have a lower proportion of Entry level employees, and an even lower proportion of Experienced employees.
df['employment_type'].value_counts()
FT 618 PT 10 CT 5 FL 4 Name: employment_type, dtype: int64
sns.countplot(df, x='employment_type', palette='RdPu')
<Axes: xlabel='employment_type', ylabel='count'>
Here I will assume:
We can see from the above information that the vast majority of employees are employed on a "Full Time" basis. Of course the chart describes that the distribution is not balanced, which raises the question of whether the few rows that are not "FT" should remain or should be disposed of. Given that we may wish to consider further the employment types of careers in AI, it might be useful to simply retain the 19 rows that are not "FT".
#to view full list
pd.set_option('display.max_rows', None)
df['job_title'].value_counts()
Data Scientist 148 Data Engineer 139 Data Analyst 104 Machine Learning Engineer 52 Research Scientist 16 Data Science Manager 15 Data Architect 11 Machine Learning Scientist 9 AI Scientist 8 Big Data Engineer 8 Data Science Consultant 7 Data Analytics Manager 7 Director of Data Science 7 Principal Data Scientist 7 BI Data Analyst 6 Lead Data Engineer 6 Computer Vision Engineer 6 Data Engineering Manager 5 Applied Data Scientist 5 Head of Data 5 Business Data Analyst 5 Machine Learning Developer 4 Analytics Engineer 4 Applied Machine Learning Scientist 4 Head of Data Science 4 Data Analytics Engineer 4 Machine Learning Infrastructure Engineer 3 Data Science Engineer 3 Lead Data Analyst 3 Principal Data Engineer 3 Computer Vision Software Engineer 3 Lead Data Scientist 3 Director of Data Engineering 2 Product Data Analyst 2 Principal Data Analyst 2 ETL Developer 2 Financial Data Analyst 2 Cloud Data Engineer 2 Data Analytics Lead 1 Finance Data Analyst 1 Staff Data Scientist 1 Big Data Architect 1 3D Computer Vision Researcher 1 Marketing Data Analyst 1 NLP Engineer 1 Machine Learning Manager 1 Head of Machine Learning 1 Lead Machine Learning Engineer 1 Data Specialist 1 Name: job_title, dtype: int64
From the information above I would argue that it is not entirely suitable to visualise the distribution of 'job_title' as there are many values that occur only once or twice. This suggests the variable may be flat and wide. However, it may be useful to be able to visualise the 10 most common jobs in AI. To do this, we can make use of create a new DataFrame that contains this information. This approach will visualise an answer to the question what is the most common job in AI?
def getTop10Jobs():
top_10_jobs = df['job_title'].value_counts()[0:11]
return top_10_jobs
top_10_jobs = getTop10Jobs()
top_10_jobs
Data Scientist 148 Data Engineer 139 Data Analyst 104 Machine Learning Engineer 52 Research Scientist 16 Data Science Manager 15 Data Architect 11 Machine Learning Scientist 9 AI Scientist 8 Big Data Engineer 8 Data Science Consultant 7 Name: job_title, dtype: int64
def displayTop10Jobs(top_10):
top_10_jobs.plot(kind='barh', color='plum')
displayTop10Jobs(top_10_jobs)
Evidently, the 'Data Scientist' is the most common job in AI according to this dataset. This chart simply shows the number of times these job titles appears in the dataset. Most values are one of 'Machine Learning Engineer', 'Data Analyst', 'Data Engineer', or 'Data Scientist'.
In addition, we can address another question outlined earlier easily by making use of the Statistics library in Python. We have already identified the highest salary as being 600,000 USD. To explicitly calculate the average salary of jobs in AI, we can simply invoke the mean() function and supply our column name, 'salary'.
import statistics
print("Average salary of jobs in AI (in USD):", int(statistics.mean(df['salary'])))
Average salary of jobs in AI (in USD): 315106
It is useful to know where the majority of jobs in AI are located. As such, I inspect the variables 'employee_residence' and 'company_location' more closely below.
df['employee_residence'].value_counts()
US 354 GB 46 IN 30 CA 29 DE 26 FR 18 ES 15 GR 13 PT 7 JP 7 PK 6 BR 6 NL 5 IT 4 RU 4 AU 4 PL 4 VN 3 TR 3 SG 3 AT 3 AE 3 HU 2 NG 2 BE 2 RO 2 DK 2 MX 2 SI 2 CN 1 HK 1 MD 1 JE 1 LU 1 IR 1 NZ 1 HR 1 CO 1 KE 1 RS 1 PR 1 CL 1 EE 1 EG 1 MY 1 PH 1 BG 1 UA 1 IQ 1 ID 1 DZ 1 AR 1 CH 1 CZ 1 TN 1 HN 1 BO 1 IE 1 MT 1 Name: employee_residence, dtype: int64
This is a similar situation to the distribution of 'job_title'. Similarly, many 'employee_residence' values occur only once or twice, suggesting a flat and wide distribution. As such, I will continue to visualise the 15 most common 'employee_residence' values.
#store 15 most common employee_residences
top_15 = df['employee_residence'].value_counts()[0:16]
top_15
US 354 GB 46 IN 30 CA 29 DE 26 FR 18 ES 15 GR 13 PT 7 JP 7 PK 6 BR 6 NL 5 IT 4 RU 4 AU 4 Name: employee_residence, dtype: int64
top_15.plot(kind='barh', color='plum')
<Axes: >
The chart above describes that the vast majority of employees reside in the US.
df['company_location'].value_counts()
US 377 GB 49 CA 30 DE 29 IN 24 FR 15 ES 14 GR 11 JP 6 PT 5 AT 4 AU 4 PL 4 NL 4 BR 3 PK 3 MX 3 LU 3 AE 3 DK 3 TR 3 RU 2 BE 2 NG 2 CN 2 SG 2 CH 2 CZ 2 SI 2 IT 2 KE 1 IL 1 HR 1 AS 1 VN 1 MD 1 CL 1 IR 1 NZ 1 CO 1 EG 1 HU 1 HN 1 ID 1 IE 1 UA 1 MY 1 IQ 1 RO 1 EE 1 DZ 1 MT 1 Name: company_location, dtype: int64
#store 15 most common company_locations
top_15_company_locations = df['company_location'].value_counts()[0:16]
top_15_company_locations
US 377 GB 49 CA 30 DE 29 IN 24 FR 15 ES 14 GR 11 JP 6 PT 5 AT 4 AU 4 PL 4 NL 4 BR 3 PK 3 Name: company_location, dtype: int64
top_15_company_locations.plot(kind='barh', color='purple')
<Axes: >
The chart above describes that the companies the employees are employed by are mostly located in the US.
df['company_size'].value_counts()
M 346 L 207 S 84 Name: company_size, dtype: int64
sns.countplot(df, x='company_size', palette='RdPu')
<Axes: xlabel='company_size', ylabel='count'>
Here I assume:
We can see from the chart above we can see the distribution of 'company_size' values is fairly evenly distributed. In addressing the question where are the majority of jobs in AI? this visualisation answers this in part. If "where" refers to companies and not locations, we can say that it is clear that most employees are working in a medium-sized company. It is also clear that a high number of employees work in large companies, with less than 100 employees working in small companies.
We explore where the majority of jobs in AI are above, but do not consider employees who work remotely. The distribution of 'remote_ratio' is presented below.
df['remote_ratio'].value_counts()
100 401 0 135 50 101 Name: remote_ratio, dtype: int64
sns.countplot(df, x='remote_ratio', palette='RdPu')
<Axes: xlabel='remote_ratio', ylabel='count'>
Here I will assume:
We can see that a large number of employees work remotely. The number of employees that do not work remotely at all and the number of employees who work remotely some of the time is fairly even.
So far we have considered single variables on their own. Before moving on to Part 4: Data Visualisation, an important part of EDA is evaluating relationships/correlations between two variables. For example, we seen how most employees reside in the US, and most companies are located in the US. A correlation matrix can help us in identifying the relationship between these two variables. A correlation matrix is an extremely useful tool that allows us to assess the strength of relationships between two given variables.
One major advantage of using Python to analyse data is the support offered by its libraries. This makes implementing the correlation matrix significantly simpler and less time consuming. We can generate a correlation matrix by invoking the corr() function, and we can generate a visual representation of the correlation matrix through libraries seaborn and matplotlib. It is an effective technique as it is not only easy to implement, but it is easy to for stakeholders and analysts alike to interpret too:
#the plain correlation matrix
correlation_matrix = df.corr()
correlation_matrix
C:\Users\daria\AppData\Local\Temp\ipykernel_34740\3299584711.py:2: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning. correlation_matrix = df.corr()
| employee_id | work_year | salary | salary_in_usd | remote_ratio | monthly_salary | |
|---|---|---|---|---|---|---|
| employee_id | 1.000000 | -0.772597 | 0.106603 | -0.165687 | -0.006387 | 0.106603 |
| work_year | -0.772597 | 1.000000 | -0.089774 | 0.180503 | 0.068606 | -0.089774 |
| salary | 0.106603 | -0.089774 | 1.000000 | -0.081715 | -0.013806 | 1.000000 |
| salary_in_usd | -0.165687 | 0.180503 | -0.081715 | 1.000000 | 0.131238 | -0.081714 |
| remote_ratio | -0.006387 | 0.068606 | -0.013806 | 0.131238 | 1.000000 | -0.013806 |
| monthly_salary | 0.106603 | -0.089774 | 1.000000 | -0.081714 | -0.013806 | 1.000000 |
In visualising the correlation matrix, the below code was abstracted from W3Schools and modified accordingly [4].
import matplotlib.pyplot as plt
#visualise the correlation matrix
#following code abstracted from W3Schools
axis_corr = sns.heatmap(
correlation_matrix, #use our correlation_matrix
vmin = -1, vmax = 1, center = 0, #set min to -1, max to 1, center to 0
#cmap = sns.diverging_palette(50, 500, n=500), #define colours
cmap = sns.color_palette('RdPu'), #define colours
square = True #show me the squares
)
plt.show()
We have a lot of information gathered now about our dataset and have found answers to the questions outlined at the start of the notebook. We can now go further to graphically represent any trends and patterns in the dataset, and to visualise how variables relate to one another.
The box plot above is effective in visualising the distribution of salaries across the time interval of 3 years. It allows users to easily compare statistics like the median of each group (in this case, each year). It neatly presents outliers as dots outside of the box too.
So far, this project has made use of line plots, bar plots, and pie charts to visualise road accident data. In visualising accidents by road type, a pie chart would do the job nicely. However, it is also suitable to explore visualising this data using a violin plot, which effectively presents the relationship between two variables. Of course, the box plot can describe basic distributions; however, the violin plot is really a mix of the box plot and kernel density plot. Therefore, the violin plot comes with added advantages over the simple pie chart and box plot as it also has the ability to present summary statistics along with each variable's density.
Below, there are two violin plots. Each uses and presents the same data, with one utilising Seaborn and the other Plotly. The violin plot created with Seaborn is, to me, easier to look at as it shows very clearly the median as a white dot, along with the interquartile range and 1.5x interquartile range through the box plot inside the violin. The violin plot produced using Plotly is more engaging, as a user can hover over it and inspect the data further.
import plotly.express as px
sns.boxplot(x='work_year', y='salary_in_usd', data=df, palette='RdPu')
<Axes: xlabel='work_year', ylabel='salary_in_usd'>
fig = px.box(df, x='work_year', y='salary_in_usd')
fig.update_traces(line_color='purple')
fig.show()
def salByYearViolin():
f, ax = plt.subplots(figsize=(8, 8))
# Show each distribution with both violins and points
sns.violinplot(x='work_year',y="salary_in_usd",data=df, inner="box", cut=2, linewidth=3, palette='RdPu')
sns.despine(left=True)
f.suptitle('Salary by Year', fontsize=18, fontweight='bold')
ax.set_xlabel("Year",size = 16,alpha=0.7)
ax.set_ylabel("Salary in USD",size = 16,alpha=0.7)
salByYearViolin()
def salByYearViolinHover():
fig = px.violin(df, y='salary_in_usd', x='work_year')
fig.update_layout(
title="Salaries by Year",
yaxis_title="Salaries (USD)",
xaxis_title="Year"
)
fig.update_traces(line_color='purple')
fig.show()
salByYearViolinHover()
As a student, it is always interesting to find out how salaries can increase with more experience. We can visualise this by plotting salaries against experience_level.
sns.boxplot(x='experience_level', y='salary_in_usd', data=df, palette='RdPu')
<Axes: xlabel='experience_level', ylabel='salary_in_usd'>
def salByExpLevelBox():
fig = px.box(df, x='experience_level', y='salary_in_usd')
fig.update_traces(line_color='purple')
fig.show()
salByExpLevelBox()
plt.figure(figsize=(15,9))
sns.kdeplot(data=df, x='salary_in_usd', hue='experience_level', fill=False, linewidth=5)
plt.title("Distribution of Salary by Experience Level", fontsize=20)
plt.xlabel("Salary (in USD)", fontsize=18)
plt.ylabel("Density", fontsize=18)
plt.xticks(fontsize=16)
plt.yticks(fontsize=16)
plt.show()
The kernel density plot above describes, as expected, that employees who are more experienced earn significantly more than those with less experience. The green line in the plot describes that Senior employees in AI earn the highest salary. What is surprising in this kernel density plot, and perhaps very misleading, is that employees of experience level "EX" (which I have assumed is one of the highest levels of experience) appear to earn a lower salary overall. However, this can be explained if we recall the variable distribution of 'experience_level' - the number of employees with this experience_level is significantly lower than the number of employees with other 'experience_levels'. Indeed, the dataset contains only 26 rows with the value "EX". Therefore, the kernel density plot above is perhaps not a completely true reflection of reality.
It is encouraging to see that the median salary for experience_level "EN" is a tidy sum of 56.36k. What would also be interesting to find out is the average salary according to job. To investigate this, a pivot table can help. A pivot table is a significantly less time consuming way to calculate the number of a given group. This can then be displayed nicely using a bar chart, which can be made more user-friendly and more readable by formatting labels horizontally. This will allow users to glance at the plot and quickly locate information.
def createSalsByJobPivotTable():
#set index to job_title, use salary_in_usd as values
#set aggfunc to mean as we want to see each job's average salary
salaries_by_job_pivot_table = pd.pivot_table(data=df,index=['job_title'],
values=['salary_in_usd'],
aggfunc='mean').sort_values(by=['salary_in_usd'], ascending=False)
return salaries_by_job_pivot_table
salaries_by_job_pivot_table = createSalsByJobPivotTable()
salaries_by_job_pivot_table #For testing purposes
| salary_in_usd | |
|---|---|
| job_title | |
| Data Analytics Lead | 405000.000000 |
| Principal Data Engineer | 328333.333333 |
| Financial Data Analyst | 275000.000000 |
| Principal Data Scientist | 215116.285714 |
| Director of Data Science | 195027.000000 |
| Data Architect | 177873.909091 |
| Applied Data Scientist | 175655.000000 |
| Analytics Engineer | 175000.000000 |
| Data Science Manager | 169252.866667 |
| Data Specialist | 165000.000000 |
| Head of Data | 160126.800000 |
| Director of Data Engineering | 156738.000000 |
| Head of Data Science | 146718.750000 |
| Machine Learning Scientist | 143344.444444 |
| Applied Machine Learning Scientist | 142025.500000 |
| Lead Data Engineer | 139691.666667 |
| Data Analytics Manager | 127134.285714 |
| Cloud Data Engineer | 124647.000000 |
| Data Engineering Manager | 123227.200000 |
| Principal Data Analyst | 122500.000000 |
| Machine Learning Manager | 117104.000000 |
| Lead Data Scientist | 115190.000000 |
| Data Engineer | 113119.223022 |
| Machine Learning Engineer | 111893.230769 |
| Data Scientist | 109480.628378 |
| Machine Learning Developer | 109330.000000 |
| Research Scientist | 108965.812500 |
| Computer Vision Software Engineer | 105248.666667 |
| Staff Data Scientist | 105000.000000 |
| Machine Learning Infrastructure Engineer | 101039.333333 |
| Big Data Architect | 99703.000000 |
| Lead Data Analyst | 92203.000000 |
| Data Analyst | 92077.701923 |
| Marketing Data Analyst | 88654.000000 |
| Lead Machine Learning Engineer | 87454.000000 |
| AI Scientist | 82868.625000 |
| Head of Machine Learning | 78747.000000 |
| Business Data Analyst | 76654.000000 |
| Data Science Engineer | 75803.333333 |
| BI Data Analyst | 74755.166667 |
| Data Science Consultant | 69420.714286 |
| Data Analytics Engineer | 64799.250000 |
| Finance Data Analyst | 61896.000000 |
| ETL Developer | 54659.000000 |
| Big Data Engineer | 51974.000000 |
| Computer Vision Engineer | 44419.333333 |
| NLP Engineer | 37047.000000 |
| Product Data Analyst | 13036.000000 |
| 3D Computer Vision Researcher | 5409.000000 |
def displayTop20AvgSals(pivot_table):
#supply argument 20 to head() to display only 20 average salaries
avg_sals = pivot_table.head(20).plot(kind='barh', color='plum')
avg_sals.tick_params(axis='x', labelrotation = 45)
displayTop20AvgSals(salaries_by_job_pivot_table)
def displayWorst20AvgSals(pivot_table):
#supply argument 20 to tail() to display only 20 average salaries
avg_sals = salaries_by_job_pivot_table.tail(20).plot(kind='barh', color='purple')
avg_sals.tick_params(axis='x', labelrotation = 45)
displayWorst20AvgSals(salaries_by_job_pivot_table)
Now, we can see clearly that the highest average salary belongs to 'Data Analytics Lead'. The lowest average salary belongs to '3D Computer Vision Researcher'.
For many entry-level positions, remote work is not always an option. I would like to inspect the salaries according to company location. To do this, I will make use of a bar plot where the x axis represents the country location and the y axis represents the average salary. We must first extract a subset of data from the DataFrame, where we group the dataset by 'company_location', find the average of 'salary_in_usd' according to each group, and then sort the data from highest to lowest.
def getAvgSalsByCompLocation():
#extract the data
avg_sals_comp_location = df.groupby('company_location')[['salary_in_usd']].mean().sort_values('salary_in_usd', ascending = False)
return avg_sals_comp_location
def displayAvgSalsByCompLocation(avg_sals_comp_location):
fig = px.bar(x=avg_sals_comp_location.head(25).index,
y=avg_sals_comp_location.head(25).salary_in_usd,
color=avg_sals_comp_location.head(25).salary_in_usd,
color_continuous_scale=px.colors.sequential.Purp)
fig.update_layout(
title="Top 25 Average Salaries by Company Location",
yaxis_title="Salary (USD)",
xaxis_title="Country Code"
)
fig.show()
avg_sals_comp_location = getAvgSalsByCompLocation()
avg_sals_comp_location
| salary_in_usd | |
|---|---|
| company_location | |
| RU | 157500.000000 |
| US | 144988.458886 |
| NZ | 125000.000000 |
| IL | 119059.000000 |
| AU | 115558.750000 |
| JP | 114127.333333 |
| DZ | 100000.000000 |
| IQ | 100000.000000 |
| AE | 100000.000000 |
| CA | 99786.800000 |
| BE | 85699.000000 |
| DE | 81334.965517 |
| GB | 81308.857143 |
| SG | 77622.000000 |
| AT | 72832.750000 |
| CN | 71665.500000 |
| IE | 71056.000000 |
| PL | 66028.000000 |
| FR | 63912.200000 |
| CH | 63834.500000 |
| SI | 63831.000000 |
| RO | 60000.000000 |
| NL | 54860.750000 |
| DK | 54386.333333 |
| ES | 52921.571429 |
| GR | 52062.545455 |
| CZ | 50850.500000 |
| HR | 45618.000000 |
| LU | 43942.666667 |
| PT | 42709.400000 |
| CL | 40038.000000 |
| MY | 40000.000000 |
| IT | 36366.500000 |
| HU | 35735.000000 |
| EE | 32795.000000 |
| MX | 32123.333333 |
| NG | 30000.000000 |
| IN | 28559.041667 |
| MT | 28369.000000 |
| EG | 22800.000000 |
| CO | 21844.000000 |
| TR | 20096.666667 |
| HN | 20000.000000 |
| BR | 18602.666667 |
| AS | 18053.000000 |
| MD | 18000.000000 |
| ID | 15000.000000 |
| UA | 13400.000000 |
| PK | 13333.333333 |
| KE | 9272.000000 |
| IR | 4000.000000 |
| VN | 4000.000000 |
displayAvgSalsByCompLocation(avg_sals_comp_location)
But money is not always the top priority for many employees. For some, being able to work in one's home environment is important to their work-life balance. It would be useful to be able to see if the potential for home-working is realistic or not.
#for readability
df['remote_ratio'] = df['remote_ratio'].replace({0:'No remote work', 50:'Some remote work', 100:'Fully remote'})
def displayRemoteWorkingByYear():
fig = px.histogram(df, x='remote_ratio', color='work_year', barmode='group',
category_orders={
'remote_ratio': ['No remote work', 'Some remote work', 'Fully remote'],
'work_year': [2020, 2021, 2022]
},
text_auto=True, #display values in bars
color_discrete_sequence=['rebeccapurple', 'plum', 'darkorchid'] # color of histogram bars
)
fig.update_layout(
title="Remote Working 2020-2022",
yaxis_title="Number of Employees",
xaxis_title="Remote Ratio"
)
fig.show()
displayRemoteWorkingByYear()
!pip install wordcloud
Defaulting to user installation because normal site-packages is not writeable Requirement already satisfied: wordcloud in c:\users\daria\appdata\roaming\python\python311\site-packages (1.9.2) Requirement already satisfied: numpy>=1.6.1 in c:\programdata\anaconda3\lib\site-packages (from wordcloud) (1.24.3) Requirement already satisfied: pillow in c:\programdata\anaconda3\lib\site-packages (from wordcloud) (9.4.0) Requirement already satisfied: matplotlib in c:\programdata\anaconda3\lib\site-packages (from wordcloud) (3.7.1) Requirement already satisfied: contourpy>=1.0.1 in c:\programdata\anaconda3\lib\site-packages (from matplotlib->wordcloud) (1.0.5) Requirement already satisfied: cycler>=0.10 in c:\programdata\anaconda3\lib\site-packages (from matplotlib->wordcloud) (0.11.0) Requirement already satisfied: fonttools>=4.22.0 in c:\programdata\anaconda3\lib\site-packages (from matplotlib->wordcloud) (4.25.0) Requirement already satisfied: kiwisolver>=1.0.1 in c:\programdata\anaconda3\lib\site-packages (from matplotlib->wordcloud) (1.4.4) Requirement already satisfied: packaging>=20.0 in c:\programdata\anaconda3\lib\site-packages (from matplotlib->wordcloud) (23.0) Requirement already satisfied: pyparsing>=2.3.1 in c:\programdata\anaconda3\lib\site-packages (from matplotlib->wordcloud) (3.0.9) Requirement already satisfied: python-dateutil>=2.7 in c:\programdata\anaconda3\lib\site-packages (from matplotlib->wordcloud) (2.8.2) Requirement already satisfied: six>=1.5 in c:\programdata\anaconda3\lib\site-packages (from python-dateutil>=2.7->matplotlib->wordcloud) (1.16.0)
#code to construct a wordcloud abstracted from GeeksforGeeks [5]
from wordcloud import WordCloud, STOPWORDS
from collections import Counter
def buildWordCloud():
#store job titles in Counter class object
words=Counter(df.job_title)
#Configure the wordcloud
wordcloud = WordCloud(width = 5000, height = 3500, background_color='black', colormap='RdPu',
collocations= False, #exclude collocations of two words
stopwords=STOPWORDS, #used to remove common words such as conjunctions and pronouns
)
#Create the word cloud using job titles and their frequencies
wordcloud.generate_from_frequencies(words)
return wordcloud
def displayWordcloud():
wordcloud = buildWordCloud()
#Set overall size
plt.figure(figsize=(25,20))
#Don't display axes
plt.axis("off")
#imshow() function to display wordcloud data as an image
plt.imshow(wordcloud)
Whether you are in your senior years of high school, just starting out in your career, or thinking of taking a completely new direction, there's a job in AI for you. Read on to find out why you should consider a career in the AI industry.
It's no secret that technology has grown exponentially in recent decades. For many, Artificial Intelligence is a mystery, but it's been around for longer than some think. AI has become far more prominent recently as computing equipment is far less expensive than it was in the 20th Century. Machines can now remember significantly more information now than in the 1950s. You might not realise it, but you're probably interacting with AI every day - from self-serve checkouts to chats with Alexa, and from fraud detection to 3D printed prostheses. The bottom line is, AI career possibilities are endless. The number of jobs in AI is increasing rapidly, and there's no sign of this stopping. We need all sorts of people to keep the world running, and these are just some of the jobs that you could be doing in AI:
displayWordcloud()
Working in AI isn't just for coders sitting in a dark room for hours on end. It's a diverse field, and there's probably something for everyone.
It's important you're rewarded for your hard work. Having a decent income will give you financial security. From our data, we've calculated the highest salary being 600,000 USD, and the average as 113,275 USD. That's just the average over a three year period.
salByYearViolin()
Look closely at the chart. The white dot represents the average salary for that year. Take a closer look below by hovering over the chart.
salByYearViolinHover()
The average salary rises from 74.13k in 2020 to 82.528k in 2021, and to 120.16k in 2022. This can't be said for all industries. Even if you're just starting out in AI, entry-level positions offer a very good pay package:
salByExpLevelBox()
What this shows you is the average salary for entry level positions is 56.36k. And the more experienced you become, the higher the salary you will likely receive. With around 5-10 years of experience of working in AI, the average salary rises to 76.522k.
Hopefully, you can see that careers in AI are financially rewarding. But with so many job titles being thrown around, how do we know which jobs pay the most? Look below. These are the top 25 average salaries by job.
displayTop20AvgSals(salaries_by_job_pivot_table)
You'll see some of the highest average salaries belong to jobs in data science. But what are the number of opportunities in this job like? Is there a high demand for these particular jobs?
displayTop10Jobs(top_10_jobs)
Indeed, most jobs are in Data Science too. But maybe money isn't your top concern? If you get fed up sitting in an office all the time, there's opportunity to travel the world and sit in an office in another country.
Do you want to travel the world as part of your job? Then AI gives you that opportunity. Even if the money isn't that important to you, you'll need a decent income to support yourself wherever you are. In AI, you get the best of both worlds - good pay and a chance to see the world.
displayAvgSalsByCompLocation(avg_sals_comp_location)
Maybe Russia's not high on your list of countries to visit, but it's clear that the average salary is high in all of the countries in the chart. This isn't surprising given their hefty investment in AI. In fact, it's thought Russia's budget is approximately 4 billion rubles, equivalent to approximately 60.2 million USD [6]. Despite their war with Ukraine, President Vladimir Putin is determined to prevent other countries in the West to obtain a monopoly over AI. It's widely reported that the US and China are at the forefront of developing AI and many believe this development is going to "transform the world and revolutionise society in a way similar to the introduction of computers in the 20th century" [7].
So if you'd prefer a life in the sun in Spain, or Denmark, you can expect an average salary of 52.92k to 54.39k respectively which, although at the lower end, is still not to be sniffed at.
Remote working's absolutely an option in AI. In fact, the Covid-19 pandemic has resulted in huge changes in how people work. Yes, the companies paying the higher average salaries might be located in Russia, the US, and New Zealand. But the data shows us that remote working is on an upward trend.
displayRemoteWorkingByYear()
It's striking that in 2022, our data shows there were 247 employees working on a fully remote basis. This has risen from 37 employees working fully remotely in 2020. It's highly likely this is a result of the pandemic. When the pandemic first took hold of the world, a lot of companies were not ready to move to fully remote working, but by 2022 these companies had had time to prepare and move to remote working, hence the clear increase from 2020 (37) to 2021 (117) to 2022 (247).
So whether you're a homebird or desperate to move country, whether you're bothered about money or not, there's plenty of opportunities in AI.
The analysis undertaken in this project presented various challenges, both on a personal level and on a programatic level.
The most notable challenge occurred early in the project. Initially, a dataset pertaining to road accidents in the UK was selected (as was mentioned in Part 1). This was significantly larger than the AI salaries dataset. It contained three CSV files with approximately 500,000 rows. All was going well until Data Visualisation, where Jupyter Notebook slowed down considerably with each plot that was added. Jupyter Notebook froze completely and my own laptop simply isn't powerful enough to work with that size of dataset. Fortunately, I make an effort to manage my time well, so there was plenty of time to start afresh with a different, smaller dataset. Although very frustrating and disappointing (it was very interesting to work with real-world data), it has certainly been a learning curve in the sense that I will consider the specification of my equipment when selecting datasets in the future.
A positive side of this happening is that working with a smaller dataset has shown what could be done with real-world datasets. For example, being able to extract certain data from a dataset given a condition and display complex information in a very organised and presentable manner. I have learned that data analysis can really guide decisions in the real world. If we can turn rows and rows of data into useful information, decision-makers can act accordingly: if data analysis and visualisation highlights an upward trend in road accidents, for instance, this could trigger safety campaigns and driving test reviews.
There have been challenges relating to programming too. At first, I could not get my head around how a pivot table worked. However, after spending some time on this I can see now how beneficial they are in data visualisation. To have a single table that can quickly transform data in our desired way must be quite powerful in real-life data science. For example, I imagine this would have been very useful at the peak of the Covid-19 pandemic when analysts wanted to summarise large amounts of data that would later be presented to a diverse audience, many without scientific expertise.
Another learning curve of this assignment has been, quite simply, learning Python. I started my degree with absolutely no programming experience, and have primarily learned Java and SQL throughout years 1 and 2. The prospect of learning another language was quite daunting at first. However, this assignment has really shown me how powerful Python is. It is a language I would like to become more competent at using because it is clear how usable it is in the real world.
Overall, this has been an insightful assignment. Although it is disappointing I could not pursue this assignment using the road accident dataset, it has still presented valuable learning opportunities that have given me a deeper insight into real-world data science tasks.
[1] “datasets/salaries-ai-jobs-net.csv at master · plotly/datasets,” GitHub. https://github.com/plotly/datasets/blob/master/salaries-ai-jobs-net.csv (accessed November-December, 2023)
[2] M. Waskom, “seaborn.countplot — seaborn 0.9.0 documentation,” Pydata.org, 2012. https://seaborn.pydata.org/generated/seaborn.countplot.html
[3] “Visualizing the distribution of a dataset — seaborn 0.9.0 documentation,” Pydata.org, 2012. https://seaborn.pydata.org/tutorial/distributions.html
[4] “Data Science Statistics Correlation Matrix,” www.w3schools.com. https://www.w3schools.com/datascience/ds_stat_correlation_matrix.asp
[5] S. Kadam, “Generating Word Cloud in Python,” GeeksforGeeks, May 11, 2018. https://www.geeksforgeeks.org/generating-word-cloud-python/
[6] Samuel Bendett et al., “Artificial Intelligence, China, Russia, and the Global Order Technological, Political, Global, and Creative Perspectives,” Oct. 2019. https://www.jstor.org/stable/pdf/resrep19585.28.pdf
[7] G. Faulconbridge, “Putin says West cannot have AI monopoly so Russia must up its game” Reuters, Nov. 24, 2023. Available: https://www.reuters.com/technology/putin-approve-new-ai-strategy-calls-boost-supercomputers-2023-11-24/